Hadoop Distributions
There are various providers in the market which are offering their own version of Hadoop. So in this tip we will cover the following questions:
- What are the Hadoop distributions?
- How are the Hadoop distributions different from one another?
- Which distribution should I choose?
What are the Hadoop distributions?
Hadoop distributions available in the market are built on top of the open source Hadoop framework. The core component in these distributions is the same open source Hadoop framework built by the Apache foundation and is still distributed as open source.
Hadoop contains various components/sub-projects which have their own release cycles and it's a pretty complex ecosystem with so many projects. These distributions manage/integrate the required versions/dependencies of these projects so that enterprises can focus on the real problem at hand. These distributions ensure that, they contain a stable version of the Hadoop project with all the necessary patches along with their own proprietary components.
How are the Hadoop distributions different from one another?
These different distributions include various components built on top of the core Hadoop engine which make deployment, management and maintenance of Hadoop (and other projects within the Hadoop ecosystem) simpler, faster, and more efficient. In most cases, there are additional components being built by the providers/distributors which are proprietary to respective distributions and are provided at a cost.
Which distribution should I choose?
Every distribution has its own pros and cons. So the answer is, it depends. There are various aspects to be considered including cost, simplicity, performance, support, documentation, cloud/on-premises versions, and so on. A detailed comparison of various distributions is out of scope for this tip.
Let's take a look at some of the popular Hadoop distributions available in the market to help guide your decision.